[SOUND]
This
lecture is about
the Paradigmatics Relation Discovery.
In this lecture we are going to talk about
how to discover a particular kind of word
association called
a paradigmatical relation.
By definition,
two words are paradigmatically
related if they share a similar context.
Namely, they occur in
similar positions in text.
So naturally our idea of discovering such
a relation is to look at the context
of each word and then try to compute
the similarity of those contexts.
So here is an example of
context of a word, cat.
Here I have taken the word
cat out of the context and
you can see we are seeing some remaining
words in the sentences that contain cat.
Now, we can do the same thing for
another word like dog.
So in general we would like to capture
such a context and then try to assess
the similarity of the context of cat and
the context of a word like dog.
So now the question is how can we
formally represent the context and
then define the similarity function.
So first, we note that the context
actually contains a lot of words.
So, they can be regarded as
a pseudo document, a imagine
document, but there are also different
ways of looking at the context.
For example, we can look at the word
that occurs before the word cat.
We can call this context Left1 context.
All right, so in this case you
will see words like my, his, or
big, a, the, et cetera.
These are the words that can
occur to left of the word cat.
So we say my cat, his cat,
big cat, a cat, et cetera.
Similarly, we can also collect the words
that occur right after the word cat.
We can call this context Right1, and
here we see words like eats,
ate, is, has, et cetera.
Or, more generally,
we can look at all the words in
the window of text around the word cat.
Here, let's say we can take a window
of 8 words around the word cat.
We call this context Window8.
Now, of course, you can see all
the words from left or from right, and
so we'll have a bag of words in
general to represent the context.
Now, such a word based representation
would actually give us
an interesting way to define the
perspective of measuring the similarity.
Because if you look at just
the similarity of Left1,
then we'll see words that share
just the words in the left context,
and we kind of ignored the other words
that are also in the general context.
So that gives us one perspective to
measure the similarity, and similarly,
if we only use the Right1 context,
we will capture this narrative
from another perspective.
Using both the Left1 and
Right1 of course would allow us to capture
the similarity with even
more strict criteria.
So in general, context may contain
adjacent words, like eats and
my, that you see here, or
non-adjacent words, like Saturday,
Tuesday, or
some other words in the context.
And this flexibility also allows us
to match the similarity in somewhat
different ways.
Sometimes this is useful,
as we might want to capture
similarity base on general content.
That would give us loosely
related paradigmatical relations.
Whereas if you use only the words
immediately to the left and
to the right of the word, then you
likely will capture words that are very
much related by their syntactical
categories and semantics.
So the general idea of discovering
paradigmatical relations
is to compute the similarity
of context of two words.
So here, for example,
we can measure the similarity of cat and
dog based on the similarity
of their context.
In general, we can combine all
kinds of views of the context.
And so the similarity function is,
in general,
a combination of similarities
on different context.
And of course, we can also assign
weights to these different
similarities to allow us to focus
more on a particular kind of context.
And this would be naturally
application specific, but again,
here the main idea for discovering
pardigmatically related words is
to computer the similarity
of their context.
So next let's see how we exactly
compute these similarity functions.
Now to answer this question,
it is useful to think of bag of words
representation as vectors
in a vector space model.
Now those of you who have been
familiar with information retrieval or
textual retrieval techniques would
realize that vector space model has
been used frequently for
modeling documents and queries for search.
But here we also find it convenient
to model the context of a word for
paradigmatic relation discovery.
So the idea of this
approach is to view each
word in our vocabulary as defining one
dimension in a high dimensional space.
So we have N words in
total in the vocabulary,
then we have N dimensions,
as illustrated here.
And on the bottom, you can see a frequency
vector representing a context,
and here we see where eats
occurred 5 times in this context,
ate occurred 3 times, et cetera.
So this vector can then be placed
in this vector space model.
So in general,
we can represent a pseudo document or
context of cat as one vector,
d1, and another word,
dog, might give us a different context,
so d2.
And then we can measure
the similarity of these two vectors.
So by viewing context in
the vector space model,
we convert the problem of
paradigmatical relation discovery
into the problem of computing
the vectors and their similarity.
So the two questions that we
have to address are first,
how to compute each vector, and
that is how to compute xi or yi.
And the other question is how
do you compute the similarity.
Now in general, there are many approaches
that can be used to solve the problem, and
most of them are developed for
information retrieval.
And they have been shown to work well for
matching a query vector and
a document vector.
But we can adapt many of
the ideas to compute a similarity
of context documents for our purpose here.
So let's first look at
the one plausible approach,
where we try to match
the similarity of context based on
the expected overlap of words,
and we call this EOWC.
So the idea here is to represent
a context by a word vector
where each word has a weight
that's equal to the probability
that a randomly picked word from
this document vector, is this word.
So in other words,
xi is defined as the normalized
account of word wi in the context, and
this can be interpreted as
the probability that you would
actually pick this word from d1
if you randomly picked a word.
Now, of course these xi's would sum to one
because they are normalized frequencies,
and this means the vector is
actually probability of
the distribution over words.
So, the vector d2 can be also
computed in the same way, and
this would give us then two probability
distributions representing two contexts.
So, that addresses the problem
how to compute the vectors, and
next let's see how we can define
similarity in this approach.
Well, here, we simply define
the similarity as a dot product of two
vectors, and
this is defined as a sum of the products
of the corresponding
elements of the two vectors.
Now, it's interesting to see
that this similarity function
actually has a nice interpretation,
and that is this.
Dot product, in fact that gives
us the probability that two
randomly picked words from
the two contexts are identical.
That means if we try to pick a word
from one context and try to pick another
word from another context, we can then
ask the question, are they identical?
If the two contexts are very similar,
then we should expect we frequently will
see the two words picked from
the two contexts are identical.
If they are very different,
then the chance of seeing
identical words being picked from
the two contexts would be small.
So this intuitively makes sense, right,
for measuring similarity of contexts.
Now you might want to also take
a look at the exact formulas and
see why this can be interpreted
as the probability that
two randomly picked words are identical.
So if you just stare at the formula
to check what's inside this sum,
then you will see basically in each
case it gives us the probability that
we will see an overlap on
a particular word, wi.
And where xi gives us a probability that
we will pick this particular word from d1,
and yi gives us the probability
of picking this word from d2.
And when we pick the same
word from the two contexts,
then we have an identical pick, right so.
That's one possible approach, EOWC,
extracted overlap of words in context.
Now as always, we would like to assess
whether this approach it would work well.
Now of course, ultimately we have to
test the approach with real data and
see if it gives us really
semantically related words.
Really give us paradigmatical relations,
but
analytically we can also analyze
this formula a little bit.
So first, as I said,
it does make sense, right, because this
formula will give a higher score if there
is more overlap between the two contexts.
So that's exactly what we want.
But if you analyze
the formula more carefully,
then you also see there might
be some potential problems,
and specifically there
are two potential problems.
First, it might favor matching
one frequent term very well,
over matching more distinct terms.
And that is because in the dot product,
if one element has a high value and this
element is shared by both contexts and
it contributes a lot to the overall sum,
it might indeed make the score
higher than in another case,
where the two vectors actually have
a lot of overlap in different terms.
But each term has a relatively low
frequency, so this may not be desirable.
Of course, this might be
desirable in some other cases.
But in our case, we should intuitively
prefer a case where we match
more different terms in the context,
so that we have more confidence
in saying that the two words
indeed occur in similar context.
If you only rely on one term and
that's a little bit questionable,
it may not be robust.
Now the second problem is that it
treats every word equally, right.
So if you match a word like the and
it will be the same as
matching a word like eats, but
intuitively we know
matching the isn't really
surprising because the occurs everywhere.
So matching the is not as such
strong evidence as matching what
a word like eats,
which doesn't occur frequently.
So this is another
problem of this approach.
In the next chapter we are going to talk
about how to address these problems.
[MUSIC]

